Hydro-chemical based assessment of groundwater vulnerability in the Holocene multi-aquifers of Ganges delta

Determining the degree of high groundwater arsenic (As) and fluoride (F−) risk is crucial for successful groundwater management and protection of public health, as elevated contamination in groundwater poses a risk to the environment and human health. It is a fact that several non-point sources of pollutants contaminate the groundwater of the multi-aquifers of the Ganges delta. This study used logistic regression (LR), random forest (RF) and artificial neural network (ANN) machine learning algorithm to evaluate groundwater vulnerability in the Holocene multi-layered aquifers of Ganges delta, which is part of the Indo-Bangladesh region. Fifteen hydro-chemical data were used for modelling purposes and sophisticated statistical tests were carried out to check the dataset regarding their dependent relationships. ANN performed best with an AUC of 0.902 in the validation dataset and prepared a groundwater vulnerability map accordingly. The spatial distribution of the vulnerability map indicates that eastern and some isolated south-eastern and central middle portions are very vulnerable in terms of As and F− concentration. The overall prediction demonstrates that 29% of the areal coverage of the Ganges delta is very vulnerable to As and F− contents. Finally, this study discusses major contamination categories, rising security issues, and problems related to groundwater quality globally. Henceforth, groundwater quality monitoring must be significantly improved to successfully detect and reduce hazards to groundwater from past, present, and future contamination.


Study area
The Ganges and Brahmaputra delta, known as the Ganges delta, is one of the mega-deltas in the world, covering an area of approximately 105,000 km 2 .It consists of Bangladesh and parts of India's state of West Bengal, formed by sedimentation of the Ganga, Meghna and Brahmaputra rivers at the Bay of Bengal during the late Holocene to recent times 20 .The delta stretches from 21° 10′ 42″ to 24° 50′ 39″ N latitude and 87° 30′ 21″ to 91° 26′ 46″ E longitude (Fig. 1) and has a shoreline of nearly 350 km along the Bay of Bengal.The Ganges delta has been divided into three parts from a geological perspective i.e., "Moribund delta, Active delta and Mature delta" 34 .The delta's stratigraphic section shows alternating sand-dominated and fine-grained phases with intricate interfingerings between them 33 .This delta enclosed by "Precambrian crystalline rocks" to the north and west and the "Assam-Arakan Neogene fold belt" to the east, signifies a comprehensive sedimentation history during the late Quaternary period 35 .Literatures indicates that numerous elevated terraces from the Pleistocene era are present both within and along the periphery of its alluvial plain 36 .Present evidence considering remote sensing supporting neotectonics activities in the Gangetic plain 37 .Salinity has impacted aquifers in the coastal regions of Bangladesh, reaching depths of up to 350 m, furthermore the salinity levels in the upper aquifers of the coastal region, reaching depths of 200-250 m, demonstrate notable fluctuations and experience abrupt changes over short distances 38 .The monsoon season (June-October) accounts for more than 80% of the annual rainfall, which ranges from 1500 to 2000 mm 39 .During the monsoon months, high rainfall and frequent tropical cyclones cause catastrophic flooding and saltwater intrusion in the land areas.The minimum seasonal temperature of the region varies from 12 to 24 °C, and the maximum ranges from 25 to 35 °C.The area has the largest population density compared to other deltaic regions due to the high soil fertility 30 .The Sundarbans, the world's largest mangrove forest, covers the southernmost part of this deltaic region also known as the Sunderban delta.Borehole data indicates that sediment primarily consists of sand and clay types.

Methodology
In this study, the following methodological steps have been followed to fulfill the current research objectives: • In the initial stage, 352 water samples were collected from the existing tube-wells in the field to assess different hydro-geochemical properties.Additionally, 352 non-sample points were created for modelling purposes.Furthermore, it is necessary to divide the dataset into train and test to assess how well machine learning models are performed.Where, train dataset is used to fit the model and the test dataset is used for validation of the respective model.The entire dataset was split into two categories in a "70:30 ratio for training and validation" of the respective models.• A total of fifteen hydro-geochemical parameters were identified for modelling groundwater vulnerability.
The following sub-section described in details regarding the methods used in this study.

Sampling and inventory dataset
Field-based water sample collection was the primary task to prepare several hydro-chemical parameters for assessing groundwater vulnerability.In this regard, a "random stratified" sampling method was used to collect water sample across the study region.A total of 352 water samples were collected to prepare the inventory dataset (Fig. 1).Standard procedures were followed during the collection of water samples.Sampling was done by running wells for 5 min as it removes the stagnant water from bore wells as well as hand pumps.The sample tube well was kept pumping until the pH and EC achieved stable conditions.Two independent (dry and clean) sample kits were used each with its own collection methods and safety measures, to keep the water samples that were taken.In order to transport each water sample from the field to the lab and keep it at 4 °C, we stored it in a water sample kit during sample collection.Measurements were made to analyze the groundwater samples obtained both on-site and off-site.The analyzed samples were split into two categories on the ArcGIS 10.4.1 platform based on a ratio of 70:30.The sample was used for training (70%) while the other was used for validation (30%).
In our current research, we opted for dry season (March-early June) data to model and map groundwater vulnerability in this susceptible region, excluding wet season data.Existing literature indicates a prevalent use of dry season data in studies related to arsenic-induced vulnerability studies 40 , as it is deemed more suitable for assessing vulnerability to arsenic-related risks.In the wet season, groundwater contamination occurs through the percolation and infiltration of surface water, facilitated by ample rainfall.This leads to the transfer of various particles, metals, and ions from surface water bodies to groundwater, resulting in temporary water contamination, which is not ideal for assessing water-related health hazards.In contrast, during the dry season, water levels remain normal, and there is no risk of water contamination through surface metals or other substances.Therefore, based on these considerations, we have exclusively utilized dry season data in our study.

MC test
To ensure the accuracy of the model's output, it is crucial to select appropriate parameters for any vulnerability assessment.To achieve this, MC analysis is one of the most important techniques.Correlation analysis has shown that a link between two or more input variables can create deviations."Tolerance (TOL) and Variance Inflation Factor (VIF)" are two statistical measures often used to test multi-collinearity among distinct components.The predictor variables have a high degree of multicollinearity when the "TOL value is < 0.10 and the VIF value is > 5".If the MC result exceeds this limit, the highly correlated factors are not suitable for modelling purposes and should be removed from the dataset; otherwise, the output result will not be optimal.The equations for TOL and VIF are presented below: where R 2 j is the R-squared value of regression using the j on all other variables regression model.

Adopted methods for groundwater vulnerability modelling
LR One can create a multivariate regression relationship between a dependent variable and several independent factors using LR.LR is a multivariate analysis model that can be used to forecast the existence or absence of a characteristic or result based on the values of many response variables.Many studies used LR as a standard or conventional way to verify the effectiveness of a new algorithm in vulnerability studies.The benefit of LR is that, unlike traditional linear regression, where the variables must all have normal distributions, it can use any combination of continuous and discrete variables as well as appropriate link functions 41 .The challenge in conducting vulnerability analysis using a LR model is choosing the appropriate sample size for the dependent and independent variables 42 .The components in multi-regression analysis must be numerical, and the variables in discriminant analysis, a related statistical model, should have a normal distribution.After converting the dependent variable into a logit variable, the LR procedure uses maximum likelihood estimation 43 .This is how LR calculates the likelihood of a specific event occurring 44 .The fundamental idea behind LR is investigating a problem in which a result assessed using dichotomous variables i.e., true or false (0 and 1) is determined based on a single or a series of independent factors 45 .The LR can expressed by the following equation: where z indicates a linear combination of a constant and the independent variables' product, and their cor- responding coefficients.The value of z varies from − ∞ to ∞, subsequently f(z) ranges from 0 to 1":

RF
The RF model is a reliable AI method for classifying various natural hazards, including groundwater vulnerability.Breiman 46 proposed a potent ensemble-learning method called random forest, which is one of the most widely used classifier ensemble techniques for feature selection, regression, and classification applications.RF is a tree-based ensemble learning technique that builds several decision trees while constructing models.Each tree structure in the ensemble model uses the original input data to train a bootstrapped sample 47 .Decision trees use a collection of binary rules to select a target variable.The data used to train the model comprises the target variable being predicted and a set of predictor variables.Using the predictor variables, the decision tree divides the data into homogenous datasets based on the target variable.The programme then assesses each predictor variable's ability to categorize the predicted value into the two groups.The splitting process continues until there are no more splits to be made 48 .RF prediction is viewed as the unweighted majority of class votes when solving classification issues.The bagging approach is used to select random samples of variables as part of the training dataset for model calibration 49 .The algorithm for RF is expressed as follows: where i k represents flood occurrence conditioning factors; 1, 2,…n are input vector x.
In a RF the general errors can be defined as follows: where x and y indicate the different flood occurrence conditioning factors, and mg represents the margin function.Again, margin function" can be described as follows

ANN
The ANN is a computational method that can obtain, display, and compute mapping from one multivariate data space to another.The objective of the ANN model is to provide a technique for forecasting results from inputs that have not been used in the modelling process 50 .An artificial neural network is trained using a series of examples of related input and output values.The goal of an artificial neural network is to create a model of the data-generation process in order to generalize and predict outcomes from inputs that it has never seen before.Back-propagation learning is the neural network approach that is most often utilized in the ANN model 51 .This neural network learning technique has three levels: an input layer, hidden layers, and an output layer.The network is trained using the back-propagation technique until a predetermined minimal error between the network's desired and actual output values is reached.When training is complete, the network is utilized as a feed-forward structure to provide a classification for the entire database (Paola and Schowengerdt 52 ).The ANN assigns each input element a specific weight, multiplies the results, adds them up, and then uses a nonlinear transfer function to construct the outcomes.The back propagation of the ANN model is expressed by the following equations: The net input of jth neuron of layer l and I iteration δ Factor for neuron jth in the output layer ith δ factor for neuron jth in the hidden layer ith where α is the momentum rate and n is the learning rate within this model.www.nature.com/scientificreports/

Selected evaluation measures
Evaluating a model's performance, which establishes whether it is relevant or not, is one of the key goals of model comparison.In the geoscientific discipline, assessment metrics for applied models are crucial to estimating their best-case performance in making predictions, especially for modelling approaches based on machine learning.Henceforth, several evaluation measures have been used by many researchers in different fields of study to optimally assess the modelling output 14,24,53,54 .After a rigorous literature survey, five prevalent evaluations metrics i.e., "sensitivity, specificity, PPV, NPV, ROC-AUC, Kappa-coefficient and F-score", were selected for this study.Alongside, the Taylor diagram is also applied in this study, which is a graphical representation of evaluation measures expressing the relationship.A useful tool for displaying and assessing classifiers is the "receiver operating characteristics (ROC) curve" the common name for a performance indicator for classification problems at different threshold levels is the AUC-ROC curve.The ROC curve, which is a graph based on the true positive rate (sensitivity) and the false positive rate (1-specificity), may be thought of as a statistic that measures how well the model performed overall 55 .The AUC-ROC value ranged from 0 to 1 and indicates a poor and good performance accordingly 56 .The following formulas were used to create the performance evaluation criteria for this study: Here, "TP is true positive, TN is true negative, FN is false negative, FP is false positive, and kappa coefficient is represented by k, observed samples by P o and predicted result by P e ".

Result Statistical measures of selected hydrochemical parameters
In this study, three statistical tests were conducted on the selected hydrochemical dataset: MC, correlation coefficient, and PCA.The MC test (Table 1) showed that all factors were within the threshold value of MC, and therefore suitable for modelling purposes.,The depth factor had the highest TOL and lowest VIF (0.66 and 1.515 respectively), while the Ca 2+ factor had the lowest TOL and highest VIF (0.38 and 0.632 respectively).Pearson's correlation coefficient was used to understand the nature of the substantial association between physical and chemical properties.The correlation coefficient (r) ranges from − 1 to + 1, with values of 0.5, 0.5-0.8 and 0.8 indicating weak, moderate, and strongly correlation, respectively.The highest correlation values were found between pH and K + (0.952) and EC and CI (0.973), while moderate relationships were found between pH and salinity (0.546), pH and Mg 2+ (0.506), EC and Na + (0.644), Mg 2+ and K + (0.593), Na + and CI (0.613), and the lowest values were found between Ca 2+ and Mg 2+ (0.422), EC and Ca 2+ (0.365), depth and HCO 3 − (0.359), etc.Details about the correlation coefficient map and table are presented in Fig. 2 and Table 2. PCA analysis showed that PC 1 consisted of 43.21% eigenvalue, followed by PC 2 and PC 3, which had 31.02% and 17.08% eigenvalue, respectively.In PC 1, the dominant factors were EC (0.933), salinity (0.927), Mg 2+ (0.874) and CI (0.924), while important factors in PC 2 important factors were F − (0.765),As (0.599) and HCO 3-(0.582) and in PC 3 dominant factors were PO 4  2− (0.620), NO 3 (0.582) and K + (0.339).The biplot map of PC 1, PC 2 and PC 3 is presented in Fig. 3.

Assessment of groundwater vulnerability
Groundwater vulnerability in the aquifers of Ganges delta was assessed using LR, RF and ANN models, and the results are presented in Fig. 4. Statistical, ML, and neural network algorithms were used to understand the spatial distribution of groundwater vulnerability in the vulnerable mega-delta region.We used ArcGIS 10.5 software     to map the final spatial distribution of vulnerability using the respective modelling outcomes.Each map was classified into five vulnerability zones namely "very low, low, moderate, high and very high" using "Jenk's natural break method".The final vulnerability maps show that very high groundwater vulnerability zones are found in the eastern and some isolated south-eastern and central middle portions.Conversely, very low groundwater vulnerability zones are found in the north-western, eastern, and south-western parts.The moderate vulnerability zone is found in the central part and isolated patches of the south-eastern and southern parts of the study area.Due to the high concentration of As and other contaminated factors in the groundwater, the eastern part of the Ganges delta, i.e., the region of Bangladesh, is very vulnerable to groundwater compared to the western part of the delta region, i.e., the state of West Bengal in India.Although two isolated patches are found to be in the very high vulnerable zone in the western region of the Ganges delta i.e., part of India in RF and ANN models (Fig. 4).

Importance hydrochemical parameters for groundwater vulnerability
It is a fact that all selected hydrochemical parameters have not equal responsibility for groundwater vulnerability assessment in this study.Therefore, it is fundamental to determine the dominant factors in each applied learning model for groundwater vulnerability.The most dominant factors were identified for the three applied models, i.e., LR, RF, and ANN.The results of the dominant factors for groundwater vulnerability are presented in Table 3 for the three applied models.Factors such as F − (0.74), Na + (0.77),As (0.69), Mg 2+ (0.58) and HCO 3 (0.54) are more dominant, while SO 4 2 (0.2), pH (0.21), EC (0.31) and PO 4 2 (0.32) are less dominant in the LR model.The "mean decrease accuracy (MDA) method of RF algorithm" revealed that Na + (0.84), F − (0.77) and As (0.72) are the most influential factors on groundwater resources followed by HCO 3 (0.55), and Mg 2+ (0.54).In ANN, the

Quality assessment of groundwater
Piper, USSL, and Wilcox diagrams were used to assess the hydrochemical properties and quality of groundwater in the Ganges delta region.The Piper diagram (Fig. 6a) showed that alkaline earth (Ca 2+ +Mg 2+ ) dominates over alkalies (Na+K) and that CI and NO 3 dominate over HCO 3 .The Wilcox diagram (Fig. 6b) showed that two samples were unsuitable, while the others fell into the doubtful to unsuitable, permissible to doubtful, good to permissible and excellent categories.The USSL diagram (Fig. 6c) revealed a high salinity, low sodium and alkali hazard dominance.The collected datasets were grouped and analyzed using a hierarchical clustering method, which showed that the second cluster, and to a lesser extent, the first cluster, significantly influenced the state and the groundwater quality.The dendrogram (Fig. 6d) showed that the first cluster covered approximately 32% of the datasets, the second cluster covered the maximum dataset (47%) and the third cluster covered the lowest dataset (21%).

Discussion
The fundamental reasons for spatio-temporal fluctuations in the groundwater supply are increasing water demand across all sectors and changing climatic conditions 57 .These factors present a significant challenge to water resource planners.This study has demonstrated that considerable concentrations of elevated arsenic and nitrate in groundwater, as well as salinization, are among the groundwater quality issues in the coastal areas of the multi-aquifers of the Ganges delta.The quality of groundwater along the coast primarily depends on geological conditions, hydrogeological processes, and chemical activities 58 .Therefore, a trustworthy assessment of groundwater vulnerability is a crucial first step in choosing the best design or framework for future water resource development.
It can be challenging to select a preferred model for assessing inherent vulnerability that can effectively match the research topic's features and the study area's geo-environmental characteristics.The literature reveals that many academics have compared two or more vulnerability indices to create a meticulously tailored intrinsic vulnerability model for their research, aiming to achieve optimal output 59,60 .In the current scenario, statistical and machine learning (ML) algorithms are widely employed in groundwater-related studies worldwide.For example, Yu et al. (2022)Vu 61 applied an integrated Variable Weight Model (VWM) and DRASTIC model to assess groundwater vulnerability in China and found that the VWM-DRASTIC combination provided optimal predictive analysis.Vu et al. 62 used a numerical model and the index-overlay method in conjunction with climate scenarios (RCPs) to evaluate groundwater vulnerability and associated sustainability in Taiwan, and they recommended optimal predictive analysis.Furthermore, several machine learning models have been utilized in various groundwater-related studies, including groundwater vulnerability 4,14,18,63 , nitrate concentration in groundwater [64][65][66] , and more.The random forest (RF) model is well-known for its numerous advantages and has been employed in various geoscientific fields, including groundwater vulnerability studies.Lahjouj et al. 67 utilized the RF algorithm in a survey of groundwater vulnerability to nitrate concentration in Morocco and achieved an accuracy assessment of 0.822 in terms of AUC-ROC.Similarly, Saha et al. 18 used RF to assess hydrochemicalbased groundwater vulnerability in parts of the Ganges delta and achieved optimal accuracy rates of 0.849 and 0.812 in the training and validation data of the ROC.Various statistical techniques are available, ranging from straightforward descriptive statistics of concentrations of specific contaminants to more complex regression analyses that consider the impacts of multiple predictor variables 68 .Binary logistic regression, sometimes known as logistic regression (LR), is a frequently used statistical technique for estimating groundwater vulnerability.LR models relate the potential influencing factors to the likelihood that a pollutant concentration will exceed a threshold value.Mohammaddost et al. 69 employed DRASTIC, EBF, and LR models in the Kabul basin of Afghanistan to assess groundwater vulnerability, and they found that LR provided 66% accuracy in AUC-ROC prediction analysis.Adiat et al. 70 applied LR for the same assessment in the Ilesa gold mining area of Nigeria, achieving an 85.7% accuracy in model prediction.Recently, with the significant advantages of neural network algorithms, several neural network models have also been used in groundwater studies.For instance, Elzain et al. 27 used the DLNN model in aquifer vulnerability studies in South Korea, while Elzain et al. 71 employed the RBNN model to assess groundwater vulnerability to nitrate contamination in the southern part of Korea.
Based on the discussion above and considering the significant advantages of statistical, machine learning (ML), and neural network algorithms, three popular learning algorithms, namely logistic regression (LR), random forest (RF), and artificial neural network (ANN), were selected for the optimal assessment of groundwater vulnerability in the mega delta of the Ganges delta, taking into account field-based hydrochemical parameters.The findings of this study demonstrate that among the applied models, ANN yields the most optimal results, Henceforth, studies on groundwater vulnerability serve as crucial measurements for the sustainable management of water resources, environmental preservation, and the guarantee of a secure and uncontaminated drinking water supply for both present and future generations.
Nonetheless, it is a fact that employing combined techniques and methodologies can aid in resolving ambiguities related to GIS-based vulnerability assessment frameworks in geoscientific fields.The approaches presented in this research can be tested in various hydrogeological and geo-environmental contexts to understand the spatial distribution of vulnerability.Evaluating groundwater vulnerability studies requires careful consideration of the data and tools used for validation.Furthermore, the limitations of this study are not considered various important factors, such as the hydrogeological process of groundwater, land use land cover, and aquifer and soil characteristics, as all of these factors affect groundwater quality.In the future, other neural networks and deep learning algorithms can be beneficial for the optimal assessment of groundwater vulnerability in the mega-delta, www.nature.com/scientificreports/considering changing climate and land use land cover.Therefore, the results of this study will be valuable to land use planners and provide fundamental information for the optimal assessment and management of groundwater risk zones accordingly.

Conclusion
Globally, assessing susceptibility to groundwater contamination is crucial for proactive management aimed at safeguarding groundwater resources for various uses.Creating more effective sustainable development policies regarding potential groundwater pollution by utilizing more precise vulnerability maps.In the Ganges deltaic region, the high concentrations of contaminants, such as arsenic (As), are primarily responsible for groundwater vulnerability, and the associated human health hazards are a significant concern for global researchers.In the present research, there is a focus on creating an effective vulnerability map for a mega-delta, specifically the Ganges delta.This involves the application of LR, RF, and ANN models in the modelling and mapping process.Sensitivity analysis indicates that the ANN output is the most optimal, followed by RF and LR.The study reveals that the neural network algorithm is the best suited for assessing groundwater vulnerability related to contamination in the study region, surpassing traditional statistical analysis.Hydrochemical parameters such as pH, NO 3 − , As, and K + dominate this deltaic aquifer, contributing to vulnerability.Overall, all vulnerability maps indicate that the study area's western, central, south, and eastern parts are highly vulnerable.Due to elevated levels of As and various ion contaminations, most groundwater samples from the Ganges delta are unsuitable for drinking and irrigation.Consequently, the improper implementation of government policies, a lack of awareness, and inadequate management are the primary concerns leading to groundwater deterioration in this region.Therefore, immediate action is necessary to sustain and conserve groundwater resources in the world's largest and most densely populated deltaic region.Henceforth, in future application of deep learning and both the dataset i.e., dry and wet season for sampling procedure will be helpful for better understanding of groundwater vulnerability in this vulnerable region.

Figure 1 .
Figure 1.Details about the study area: (a) Ganges delta in a transnational boundary of Indo-Bangladesh region, (b) Ganges delta and its morphological types and (c) litholog profile of some selected points (this map was generated using ArcGIS, version: 10.3.1, www.esri.com/ arcgis).

Figure 5 .
Figure 5. Graphical evaluation measure of applied models using Taylor diagram.

Table 1 .
Multi-collinearity analysis of selected factors.

Table 3 .
Variables importance of selected factors through applied three models.

Table 4 .
Evaluation metrices of applied models.